On the Finite Time Convergence of Cyclic Coordinate Descent 

Methods 



Ankan Saha 

Department of Computer Science 
University of Chicago 

ankans@cs.uchicago.edu 



Ambuj Tewari 

Toyota Technological Institute 
Chicago, USA 

tewari@ttic . edu 



o 



(N 



a 

o 



> 

o 
o 



X 



Abstract 

Cyclic coordinate descent is a classic optimization method that has witnessed a resurgence of in- 
terest in machine learning. Reasons for this include its simplicity, speed and stability, as well as its 
competitive performance on £i regularized smooth optimization problems. Surprisingly, very little 
is known about its finite time convergence behavior on these problems. Most existing results either 
just prove convergence or provide asymptotic rates. We fill this gap in the literature by proving 
0(1/ k) convergence rates (where k is the iteration counter) for two variants of cyclic coordinate 
descent under an isotonicity assumption. Our analysis proceeds by comparing the objective values 
attained by the two variants with each other, as well as with the gradient descent algorithm. We 
show that the iterates generated by the cyclic coordinate descent methods remain better than those 
of gradient descent uniformly over time. 

1 Introduction 

The dominant paradigm in Machine Learning currently is to cast learning problems as optimization problems. 
This is clearly borne out by approaches involving empirical risk minimization, maximum likelihood, maximum 
entropy, minimum description length, etc. As machine learning faces ever increasing and high-dimensional 
datasets, we are faced with novel challenges in designing and analyzing optimization algorithms that can 
adapt efficiently to such datasets. A mini-revolution of sorts is taking place where algorithms that were 
"slow" or "old" from a purely optimization point of view are witnessing a resurgence of interest. This paper 
considers one such family of algorithms, namely the coordinate descent methods. There has been recent 
work demonstrating the potential of these algorithms for solving ^i-regularized loss minimization problems: 



n 



(1) 



where x is possibly high dimensional predictor that is being learned from the samples Zi = {Xi, Yi) con- 
sisting of input, output pairs, £ is a convex loss function measuring prediction performance, and A > is 
a "regularization" parameter. The use of the £i norm ||a;||i (sum of absolute values of Xi) as a "penalty" 
or "regularization term" is motivated by its sparsity promoting properties and there is a large and growing 
literature studying such issues (see, e.g., [11] and references therein). In this paper, we restrict ourselves to 
analyzing the behavior of coordinate descent methods on problems like (1) above. The general idea behind 
coordinate descent is to choose, at each iteration, an index j and change Xj such that objective F decreases. 
Choosing j can be as simple as cycling through the coordinates or a more sophisticated coordinate selection 
rule can be employed. [5] use the cyclic rule which we analyze in this paper. 

Our emphasis is on obtaining ^«/re time rates, i.e. guarantees about accuracy of iterative optimization 
algorithms that hold right from the first iteration. This is in contrast to asymptotic guarantees that only hold 
once the iteration count is "large enough" (and often, what is meant by "large enough", is left unspecified). 
We feel such an emphasis is in the spirit of Learning Theory that has distinguished itself by regarding finite 
sample generalization bounds as important. For our analysis, we abstract away the particulars of the setting 
above, and view ( 1 ) as a special case of the convex optimization problem: 



F{x):^fix) + X\\xh 



(2) 



In order to obtain finite time convergence rates, one must assume that / is "nice" is some sense. This 
can be quantified in different ways including assumptions of Lipschitz continuity, differentiabihty or strong 



convexity. We will assume that / is differentiable with a Lipschitz continuous gradient. In the context of 
problem (1), it amounts to assuming that the loss i is differentiable. Many losses, such as squared loss and 
logistic loss, are differentiable. Our results therefore apply to i^ regularized squared loss ("Lasso") and to i^ 
regularized logistic regression. 

For a method as old as cyclic coordinate descent, it is surprising that little is known about finite time 
convergence even under smoothness assumptions. As far as we know, finite time results are not available even 
when A = 0. i.e. for unconstrained smooth convex minimization problem. Given recent empirical successes 
of the method, we feel that this gap in the literature needs to be filled urgently. In fact, this sentiment is shared 
in ([ 15]) by the authors who lamented, "Better understanding of the convergence properties of the algorithms 
is sorely needed." They were talking about greedy coordinate descent methods but their comment applies to 
cyclic methods as well. 

The situation with gradient descent methods is much better. There are a variety of finite time convergence 
results available in the literature ([8]). Our strategy in this paper is to leverage these results to shed some 
light on the convergence of coordinate descent methods. We do this via a series of comparison theorems that 
relate variants of coordinate descent methods to each other and to the gradient descent algorithm. To do this, 
we make assumptions both on the starting point and an additional isotonicity assumption on the gradient of 
the function /. Since finite time 0{l/k) accuracy guarantees are available for gradient descent, we are able 
to prove the same rates for two variants of cyclic coordinate descent. Here k is the iteration count and the 
constants hidden in the 0() notation are small and known. We feel it should be possible to relax, or even 
eliminate, the additional assumptions we make (these are detailed in section 4) and doing this is an important 
open problem left for future work. 

We find it important to state at the outset that our aim here is not to give the best possible rates for the 
problem (2). For example, even among gradient-based methods, faster 0(1/ k^) finite time accuracy bounds 
can be achieved using Nesterov's celebrated 1983 method ([7]) or its later variants. Instead, our goal is to 
better understand cyclic coordinate descent methods and their relationship to gradient descent. 

Related Work Coordinate descent methods are quite old and we cannot attempt a survey here. Instead, 
we refer the reader to [12] and [14] that summarize previous work and also present analyses for coordinate 
descent methods. These consider cyclic coordinate descent as well as versions that use more sophisticated 
coordinate selection rules. However, as mentioned above, the analyses either establish convergence without 
rates or give asymptotic rates that hold after sufficiently many iterations have occurred. An exception is 
[13] that does give finite time rates but for a version of coordinate descent that is not cyclic. Finite time 
guarantees for a greedy version (choosing j to be the coordinate of the current gradient with the maximum 
value) also appear in [3]. The author essentially considers minimizing a smooth convex function over the 
probability simplex and also surveys previous work on greedy coordinate descent in that setting. For finite 
time (expected) accuracy bounds for stochastic coordinate descent (choose j uniformly at random) for ii 
regularization, see [10]. 

We mentioned that the empirical success reported in [5] was our motivation to consider cyclic coordinate 
descent for ii regularized problems. They consider the Lasso problem; 

min -l||Xa;-rf + A||.t||i, (3) 

where X G M"^'^ and Y e R". In this case, the smooth part / is a quadratic 

fix)^^{Ax,x) + {b,x) (4) 

where A — X^X and b — —XJY. Note that A is symmetric and positive semidefinite. Cyclic coordinate 
descent has also been applied to the f i-regularized logistic regression problem [6]. Since the logistic loss is 
differentiable, this problem also falls into the framework of this paper 

Outline Notation and necessary definitions are given in section 2. The gradient descent algorithm along 
with two variants of cyclic coordinate descent are presented in section 3. Section 4 spells out the additional 
assumptions on / that our current analysis needs. It also proves results comparing the iterates generated by the 
three algorithms considered in the paper when they are all started from the same point. Similar comparison 
theorems in the context of solving a system of non-linear equations using Jacobi and Gauss-Seidel methods 
appear in [9]. The results in section 4 set the stage for the main results given in section 5. This section 
converts the comparison between iterates into a comparison between objective function values achieved by 
the iterates. The finite time convergence rates of cyclic coordinate descent are then inferred from rates for 
gradient descent. There are plenty of issues that are still unresolved. Section 6 discusses some of them and 
provides a conclusion. 



2 Preliminaries and Notation 

We use the lowercase letters x, y, z, g and 7 to refer to vectors throughout the paper Normally paren- 
thesized superscripts, like x^'"'^ refer to vectors as well, whereas subscripts refer to the components of the 
corresponding vectors. For any positive integer fc, [k] := {l,...,fc}. sign(a) is the interval-valued sign 

function, i.e. sign(a) = {l}or{ — 1} corresponding to a > or a < 0. For a = 0, sign(a) = [—1,1]. 

1 
Unless otherwise specified, || • || refers to the Euclidean norm ||x|| :— {^i ^1) ^ ^ II ' II 1 will denote the li 
norm, ||a;||-^ = {J^i \xi\), (•, •) denotes the Euclidean dot product {x, y) = J^i ^iVi- Through out the paper 
inequalities between vectors are to be interpreted component wise i.e. x > y means that Xi > yi for all 
i G [d]. The following definition will be used extensively in the paper: 

Definition 1 Suppose a function f : M."^ ^i- M. is differentiable on M''. Then f is said to have Lipschitz 
continuous gradient (Leg) with respect to a norm \\ ■ \\ if there exists a constant L such that 

\\Vf{x)-Vf{x')\\<L\\x-x'\\ Va;,a;'eM''. (5) 

An important fact (see, e.g., [8, Thm. 2.1.5]) we will use is that if a function / has Lipschitz continuous 
gradient with respect to a norm || • ||, then it satisfies the following generalized bounded Hessian property 

I{x) < fix') + {Wf{x'),x- x') + §lk - x'f. (6) 

An operator T : M'' ^' M is said to be isotone iff 

x>y ^ T{x) > T{y). (7) 

An important isotone operator that we wiU frequently deal with is the shrinkage operator St : M"^' ^> M 
defined, for r > 0, as 

[Sr(x)l -^ Sr{x,) (8) 

where Sr{a) is the scalar shrinkage operator: 

( a T a > T 

(9) 




3 Algorithms 

We will consider three iterative algorithms for solving the minimization problem (2). All of them enjoy the 
descent property: F{x^^^^') < F(a;*^'^') for successive iterates a;^'^' and x'^'^+^-'. 

Algoritlim 1: Gradient Descent (GD) 

Initialize: Choose an appropriate initial point x^^\ 
for fc = 0, 1, . . . do 

x^^^+^)<^S^^i^{x(^^~^I^) 
end for 

Algorithm 1, known as Gradient Descent (GD), is one of the most common iterative algorithms used for 
convex optimization (See [1], [4] and references therein). It is based on the idea that using corollary (6) to 
generate a linear approximation of / at the current iterate x^''\ we can come up with the following global 
upper approximation of F: 

Fix) < /(xf'^)) + (v/(a;W), x - x^''^) + "^H^^ " ^^''^^f + ^H^^Hi ■ 

It is easy to show that the above approximation is minimized at a; = S^/l (x*-'"'' — Vf{x^''^)/L) ([1]). This is 
the next iterate for the GD algorithm. We call it "Gradient Descent" as it reduces to the following algorithm 

^(.+1) = ,(.) _ v/(^'^-^) 

when there is no regularization (i.e. A = 0). Finite time convergence rate for the GD algorithm are well 
known. 



Algorithm 2: Cyclic Coordinate Descent (CCD) 



Initialize: Choose an appropriate initial point y^'^\ 
for fc = 0, 1, . . . do 

y(fc,0) ^ y(fc) 

for j = 1 to d do 



end for 
end for 



yik+l) 4_ y(fc,d) 



Theorem 2 Lef jx'^'^' j be a sequence generated by the GD algorithm. Then, for any minimizer x* of (2), 
and\fk > 1, 



^,.»,_^,..,<*:^£i!^ 



2fc 

The above theorem can be found in, e.g., [1, Thm. 3.1]. 

The second algorithm. Cyclic Coordinate Descent (CCD), instead of using the current gradient to update 
all components simultaneously, goes through them in a cyclic fashion. The next "outer" iterate j/^^+i) is 
obtained from y'^'"'^ by creating a series of d intermediate or "inner" iterates y''^'^\ j G [d], where y'-^'i'^ 
differs from y^*^ J^^) only in the jth coordinate whose value can be found by minimizing the following one- 
dimensional over-approximation of F over the scalar a: 

/(y(^-'^-i)) + aJ: |yf '^-^)| + [Vf{y^''=-n^ ■ (a - ^^^-'^) + |(a - /'^'^~'^f^ + X\a\ . (10) 
It can again be verified that the above minimization has the closed form solution 



S\/L f yj 



which is what CCD chooses y^ ' to be. Once all coordinates have been cycled through, y'^'^+^i is simply set 



'^ L 

I- '■'' to be. Once all coordinates have been cycled through, y^'^^ 
to be ?/('='''). Let us point out that in an actual implementation, the inner iterates j/C^J) would not be computed 
separately but y'^'^' would be updated "in place". For analysis purposes, it is convenient to give names to the 
intermediate iterates. Note that for all j' G {0, 1, . . . , d}, the inner iterate looks like 



yi^.) 



i/i T--7!Jj ,yj+i,---,yd 



In the CCD algorithm updating the jth coordinate uses the newer gradient value V f{y'^^'^ ^^) rather 
than V/(y'^'''^) which is used in GD. This makes CCD inherently sequential. In contrast, different coordinate 
updates in GD can easily be done by different processors in parallel. However, on a single processor, we 
might hope CCD converges faster than GD due to the use of "fresh" information. Therefore, it is natural to 
expect that CCD should enjoy the finite time convergence rate given in Theorem 2 ( or better). We show this 
is indeed the case under an isotonicity assumption stated in Section 4 below. Under the assumption, we are 
actually able to show the correctness of the intuition that CCD should converge faster than GD. 

The third and final algorithm that we consider is Cyclic Coordinate Minimization (CCM). The only way 
it differs from CCD is that instead of minimizing the one-dimensional over-approximation (10), it chooses 



z to mmimize. 



P,'Jfej-l) Jk,j-1) (fc,i-l) JfcJ-l)^ 



i-1 i"7-^j+i 1 ■ ■ ■ T^d 

over a. In a sense, CCM is not actually an algorithm as it does not specify how to minimize F for any 
arbitrary smooth function /. An important case when the minimum can be computed exactly is when / is 
quadratic as in (4). In that case, we have 



^j'^' =Sx/A,„[Zj ^ 



3,0 



If there is no closed form solution, then we might have to resort to numerical minimization in order to 
implement CCM. This is usually not a problem since one-dimensional convex functions can be minimized 



Algorithm 3: Cyclic Coordinate Minimization 



Initialize: Choose an appropriate initial point z^°^ 

for fc = 0, 1, . . . do 



zikfi) ^ ^(k) 




for j = 


1 to d do 




^3 


^ argmin^ 


F{z\'^^ 


V^T^ 


J, zf^^' ^ z 


{k,j-l) 


end for 






^(fe+i) ^ ^(fc,d) 




end for 







Ak,j~l) (fcj-l) JfcJ-l)^ 



numerically to an extremely high degree of accuracy in a few steps. For the purpose of analysis, we will 
assume that an exact minimum is found. Again, intuition suggests that the accuracy of CCM after any fixed 
number of iterations should be better than that of CCD since CCD only minimizes an over-approximation. 
Under the same isotonicity assumption that we mentioned above, we can show that this intuition is indeed 
correct. 

We end this section with a cautionary remark regarding terminology. In the literature, CCM appears 
much more frequently than CCD and it is actually the former that is often referred to as "Cyclic Coordinate 
Descent" (See [5] and references therein). Our reasons for considering CCD are: (i) it is a nice, efficient 
alternative to CCM, and (ii) a stochastic version of CCD(where the coordinate to update is chosen randomly 
and not cyclically) is already known to enjoy finite time 0{l/k) expected convergence rate ([10]). 

4 Analysis 

We already mentioned the known convergence rate for GD (Theorem 2) above. Before delving into the 
analysis, it is necessary to state an assumption on / which accompanied by appropriate starting conditions 
results in particularly interesting properties of the convergence behavior of GD, as described in lemma 7. The 
GD algorithm generates iterates by applying the operator 

Tgd{x) ■.= ^x/l\x — I (11) 

repeatedly. It turns out that if Tgd is an isotone operator then the GD iterates satisfy lemma 7 which is 
essential for our convergence analysis. The above operator is a composition of ^\/l, an isotone operator, 
and I — '^ f /L (where I denotes the identity operator). To ensure overall isotonicity, it suffices to assume that 
I — V//-L is isotone. This is formally stated as: 

Assumption 3 The operator x ^-^ x ^-^ is isotone. 

Similar assumptions appear in the literature comparing Jacobi and Gauss-Seidel methods for solving 
linear equations [2, Chap. 2]. When the function / is quadratic as in (4), our assumption is equivalent to 
assuming that the off-diagonal entries in A are non-positive, i.e. Ai_j < for all i ^ j. For a general smooth 
/, the following condition is sufficient to make the assumption true: / is twice-differentiable and the Hessian 
V^/(a;) at any point x has non-positive off-diagonal entries. 

In the next few subsections, we will see how the isotonicity assumption leads to an isotonically decreasing 
(or increasing) behavior of GD, CCD and CCM iterates under appropriate starting conditions. To specify 
what these starting conditions are, we need the notions of super- and subsolutions. 

Definition 4 A vector x is a supersolution iff x > S\{x — Vf{x)). Analogously, x is a subsolution iff 

x<^x{x- V/(a;)). 

Since the inequalities above are vector inequalities, an arbitrary x may neither be a supersolution nor a 
subsolution. The names "supersolution" and "subsolution" are justified because equality holds in the defini- 
tions above, i.e. x = Sa (x — V/(x)) iff a; is a minimizer of F. To see this, note that subgradient optimaUty 
conditions say that a; is a minimizer of i^ = / + A|| • || i iff for all j e \d\ 

e [V/(a;)]j + Asign(a;j) . (12) 

Further, it is easy to see that, 

Va,6eM, r>0, Oe6 + Asign(a) ^ a ^ Sx/ria ~ b/r) (13) 



We prove a couple of properties of super- and subsolutions that will prove useful later The first property refers 
to the scale invariance of the definition of super- and subsolutions and the second property is the monotonicity 
of a single variable function. 



Lemma 5 If for any t > 0, 



x>Sy./Ax-^I^] (14) 



then X is a supersolution. If x is a supersolution then the above inequality holds for all t > 0. 
Similarly, if for any r > 0, 

then X is a subsolution. Ifx is a subsolution then the above inequality holds for all r > 0. 
Proof: See Appendix B 

Lemma 6 Ifx is a supersolution (resp. subsolution) then for any j, the function 



^5,,(.,-E4ilk) 



is monotonically nondecreasing (resp. nonincreasing). 

Proof: See Appendix C ■ 

4.1 Gradient Descent 

Lemma 7 If x^'^' is a supersolution and \^x^^> } is the sequence of iterates generated by the GD algorithm 
then \fk>0, 

1) x^''+^^ < x'^''^ 2) x'^''^ is a supersolution 

Ifx^'^' is a subsolution and ja;^'^' } is the sequence of iterates generated by the GD algorithm then V/c > 0, 

1) a;(''+^) > a;('=) 2) x^'''^ is a subsolution 

Proof: We only prove the supersolution case. The proof for the subsolution case is analogous. We start with 
a supersolution a;("\ Consider the operator 

Tgd{x) := Sx/L [x- 



L 

given by (1 1). By the isotonicity assumption, Tgd is an isotone operator We will prove by induction that 
Tgd{x'^'^'') < x^"). This proves that x(''+^) < x^*-'' since a;(''+^' = Tgd{x'-'^^). Using lemma 5, the second 
claim follows by the definition of the Tqd operator 

The base case ToDix^^'^) < x^^^^ is true by Lemma 5 since x^^'' is given to be a supersolution. Now 
assume Tgd{x'--''^) < x''^\ Applying the isotone operator Tgd on both sides we get Tgd{Tgd{x^^'')) < 
Tgd{x'^^^)- Thisis thesameasTG£)(a;'-'""''^^) < x^^^^^ by definition of a;('^+^^ which completes our inductive 
claim. 



4.2 Cyclic Coordinate Descent (CCD) 

Lemma 8 Ify^^' is a supersolution and {y^''' } is the sequence of iterates generated by the CCD algorithm 
then \/k > 0, 

1) y'"'"'^^^ < y^*"^ 2) y^'') is a supersolution 

Ifyo is a subsolution and {y''"' } is the sequence of iterates generated by the CCD algorithm then Vfc > 0, 

1) y^''^^^ > y^""^ 2) y^'''^ is a subsolution 



Proof: We will only prove the supersolution case as the subsolution proof is analogous. We start with a 
supersolution y'^^\ We will prove the following: If y^'^^ is a supersolution then, 

jyC^+i) < yC^) , (15) 

y(fc+i) jg ^ supersolution (16) 

Then the lemma follows by induction on k. Let us make the induction assumption that y'--''^ is a supersolution 
and try to prove (15) and (16). To prove these, we will show that yt'^'-') < yC^') and yC^-J) is a supersolution 
by induction on j G {0, 1, . . . ,d}. This proves (15) and (16) for j/('^+^^ since y^'^+i) = yi^^^.) , 

For the base case (j = 0) of the induction, note that y^^^^^ < yC^' is trivial since the two vectors are 
equal. For the same reason, yC^'") is a supersolution since we have assumed y^'^-' to be a supersolution. Now 
assume y^'''i~^'> < y^'^) and yC^'J^^) is a supersolution for some j > 0. We want to show that yC^'-J) < yC^) 
and yC^ J) is a supersolution. 

Since yC^J^^) and yC^'-J) differ only in the jth coordinate, to show that yC^^^) < y^*^) given y('^J^i) < 
y'^'^^ it suffices to show that y('^^J) < yC^J'^i)^ i.e. 

(k.j) , (fej"-l) (fc) .ns 

Since yC^J"!) < yC^) applying the isotone operator I — V f /L on both sides and taking the jth coordinate 

gives. 






y-j j^ — y. 

Applying the scalar shrinkage operator on both sides gives, 

i>\/L [Vj ^ j < i>\/L [Vj ^- j < Vj 

The left hand side is y^ '•'' by definition while the second inequality follows because y^*^^ is a supersolution. 
Thus, we have proved (17). 

Now we prove that yC^'-J) is a supersolution. Note that we have akeady shown y^*^ J) < yC^' J^i) . Applying 
the isotone operator I j^ on both sides gives, 

yj j^ — yj j^ , yi^o) 

V. ^ ,■ yf '^■) - [^/(^^''^"^)]' < yf -^-1) _ i^fjy^'''-'^)]^ . (19) 

Applying a scalar shrinkage on both sides of (18) gives, 

bx/L [Vj J^ j < i>X/L [Vj J^ 

(k i) 

Since the right hand side is y- by definition, we have. 



Vi -yi > ^xjL \y^ ^- — 



For i ^ i,v^t have 



(21) 

The first inequality above is true because y('^J^i) is a supersolution (by Induction Assumption) (and Lemma 5). 
The second follows from (19) by applying a scalar shrinkage on both sides. Combining (20) and (21), we get 

which proves, using Lemma 5, that yC^' J) is a supersolution. 



4.3 Comparison: GD vs. CCD 

Theorem 9 Suppose {x^^' } and {y^^' } are the sequences of iterates generated by the GD and CCD algo- 
rithms respectively when started from the same supersolution x^^' — y^^'. Then, V/c > 0, 

On the other hand, if they are started from the same subsolution x^^' — y^^> then the sequences satisfy, 
Vfc>0, 

Proof: We will prove lemma 9 only for the supersolution case by induction on k. The base case is trivial 
since j/*^"-' = x^^\ Now assume y^''' < x^'^' and we will prove yC'+i) < x^''^^' . Fix a j G [d]. Note that we 
have, 

Vj - Vj - ^\/L [Vj -^ 

By Lemma 8, yC^J"!) < y'^^\ Applying the isotone operator 5^/^ ° (I — ^ f /L) on both sides and taking 
the jth coordinate gives, 

i^X/L [Vj J^ j < i=X/L [Vj ^ 

Combining this with the previous equation gives, 

^-^ < S,,. if - Mpk) . (22) 

Since y^'"'^ < x'^'"-' by induction hypothesis, applying the isotone operator S\/l ° (I ^ ^//^) on both sides 
and taking the jth coordinate gives, 



'A/L [yy Y j S ^X/L [ Xj 

By definition, 

xr'^s„.(.r-M|i^). (23) 

Combining this with the previous inequality and (22) gives, 

(fc+i) ^ (fe+i) 

y) ^ ^i ■ 

Since j was arbitrary this means i/C'+i) < x'^^^'^^ and the proof is complete. ■ 

4.4 Cyclic Coordinate Minimization (CCM) 

Since CCM minimizes a one-dimensional restriction of the function F, let us define some notation for this 
subsection. Let, 

f\j{a;x) := f{xi, . . . ,Xj-i,a,Xj+i, . . . ,Xd) 
F\j{a\x) := F{xi,...,Xj^i,a,Xj+i,...,Xd) ■ 

With this notation, CCM update can be written as: 



z) = argmm 



in F|(a;z('='^-i)) (24) 



\fz ^ J, zf '^) = zf '^-^) . 

In order to avoid dealing with infinities in our analysis, we want to ensure that the minimum in (24) above is 
attained at a finite real number This leads to the following assumption. 

Assumption 10 For any x G W^' and any j G [d], the one-variable function f\j {a; x) (and hence Fy (a; x)) 
is strictly convex. 



This is a pretty mild assumption: considerably weaker than assuming, for instance, that the function / 
itself is strictly convex. For example, when / is quadratic as in (4), then the above assumption is equivalent 
to saying that the diagonal entries Ajj of the positive semi definite matrix A are all strictly positive. This is 
much weaker than saying that / is strictly convex (which would mean A is invertible). 

The next lemma shows that the CCM update can be represented in a way that makes it quite similar to 
the CCD update. 

Lemma 11 Fix k > 0,j e [d] and consider the CCM update (24). Let g{a) = f\j{a;z'^^'^^^^). If the 

(k i) / (k 7 — 1) 

update is non-trivial, i.e. z- ^ z- , it can be written as 



zf"-' = 5./. #-^'^^ 



[V/(z('=.J--i))], 



for 



(fej) _ (k,j~i)) 



(25) 



Furthermore, we have < t < L. 



Proof: See Appendix A ■ 

We point out that this lemma is useful only for the analysis of CCM and not for its implementation (as 
r depends recursively on z ' ) except in an important special case. In the quadratic example (4), g{a) is a 

one-dimensional quadratic function. In this case t does not depend on z ■ '"' and is simply Ajj. This leads 
to an efficient implementation of CCM for quadratic /. 

We are now equipped with everything to prove the following behavior of the CCM iterates. 

Lemma 12 If zq is a supersolution and {z^'^'j is the sequence of iterates generated by the CCM algorithm 
then yk>0, 

1) z'^''+^) < z'-''^ 2) z*^*') is a supersolution 

If Zq is a subsolution and {z^^' } is the sequence of iterates generated by the CCD algorithm then Vfc > 0, 

1) z^^+i) > 2^=) 2) zC^) is a subsolution 

Proof: Again, we will only prove the supersolution case as the subsolution case is analogous. We are given 
that z'") is a supersolution. We will prove the following: if z^*^^ is a supersolution then, 

zC^+i) < z^'^) , (26) 

^.(fe+i) jg ^ supersolution . (27) 

Then the lemma follows by induction on k. Let us assume that z^''^ is a supersolution and try to prove (26) 
and (27). To prove these we will show that z^*^'-'' < z^*^' and z'-'^'^^ is a supersolution by induction on 
j e {0, 1, . . . , d}. This proves (26) and (27) for z^^+i) since zt'^+i) = z^^^'^l . 

The base case (j = 0) of the induction is trivial since z''^'"' < z^'^^ since the two vectors are equal. For 
the same reason, z^'^'^^ is a supersolution since we have assumed z^''^ to be a supersolution. Now assume 
2;('=J~i) < z^'^^ and z^'^'^"-^^ is a supersolution for some j > 0. We want to show that z''^''-'^ < z''''^ and 
zC'J) is a supersolution. If the update to z^'^'^^ was trivial, i.e. z^^J"!) = ^^'^^^^ then there is nothing to 
prove. Therefore, for the remainder of the proof assume that the update is non-trivial (and hence Lemma 1 1 
applies). 

Since z'^'^'^^^^ and z'*^'^^ differ only in the jth coordinate, to show that z''^'^' < z^'^' given that z'^'^'-'^^^ < 
z^'^K it suffices to show that z^'^'^^ < z^^'^^^\ i.e. 

z) < z^. ' ^ zy . (28) 

As in Lemma (1 1), let us denote f\j{a; z^'"''-'^^^ by g{a). The lemma gives us a t G (0, L] such that. 



Zj - ^X/r Zj 



zf"^ = S„r ( zT'^-'^ '-1^ ^ ) . (29) 



Since z^'^'^ ^^ is a supersolution by induction hypothesis and r < L, using Lemma 6 we get 



^J < -^A/L Z^ < bx/L Z^ J < Zj 



where the second inequaHty above holds because z'^^'^ ^^ < z'^^'^ by induction hypothesis and since S^/l ° 
(I — Vf/L) is an isotone operator. The third holds since z^''^ is a supersolution (coupled with Lemma 5). 
Thus, we have proved (28). 

We now need to prove that zC'J) is a supersolution. To this end, we first claim that 



This is true since 



(fc,.-i) [yf{zi^^'-'))], (,.,.) ^ [Vf{zi^'=)] 



The first equality is true by definition of g and the second by (25). Now, applying Sx/r to both sides of (30) 
and using (29), we get 



3 - ^^/^ y^3 ~ 



For i ^ j, zi '' — zi '"' and thus we have 



we have 



(31) 



T r 

= -- [[V/(z('='^-i))], - [V/(z(''^^^))],l > 

T L J 

The last inequality holds because we have already shown that zC^'J^i) > ^C^J) and thus by isotonicity of 
I — V//L, we have 

Using the monotonic scalar shrinkage operator we have 

^\/t\Z^ \>b\/T\Z^ 

which, using the inductive hypothesis that z*^''-'^^' is a supersolution, further yields 



Combining (31) and (32), we get 



.(^^^) > Sx,r (z^^^^ - ^^^^^^ 



which proves, using Lemma 5, that z^'^'i' is a supersolution. ■ 

4.5 Comparison: CCD vs. CCM 

Theorem 13 Suppose {y^^'\ and {z^^'\ are the sequences of iterates generated by the CCD and CCM 
algorithms respectively when started from the same supersolution y^'^' — z'"-*. Then, Vfc > 0, 

z« < y^'^) . 
On the other hand, if they are started from the same subsolution y^^' — z^^' then the sequences satisfy, 

yk>o, 

z« > y^"^ . 



Proof: We will only prove the supersolution case as the subsolution case is analogous. Given that y'"^ — z^^^ 
is a supersolution, we will prove the following: if z'^^^ < y'^^^ then. 

Then the lemma follows by induction on k. Let us assume z^^"^ < y'^^^ and try to prove (33). To this end we 
will show that z^'^'-'-' < j/C^'-') by induction on j e {0, 1, ... , d}. This infers (33) since z'^^^^^ = 2;''^''''' and 

y{k+l) ^yik.d)^ 

The base case (j — 0) is true by the given condition in the lemma since 2:'^*^'°^ — z'^^^ as well as y('^^o) — 
y{k)^ Now, assume z^'^'-'"^^ < yC^'-J^i) for some j > 0. We want to show that z^'^'^^ < y'^^'^\ 

Since zC^J-i), ^C^J) and yC^'J-i),^^^^) differ only in the jth coordinate, to show that z''"'-'^ < j/^'"'-') 
given that z^^J"!) < yC^J-i)^ it suffices to show that 

zf '■'■) < yf-'^') . (34) 

If the update to z^'^'^^ is non-trivial then using Lemma 11, there is a r G (0, L], such that 

Zj - ^\/t Zj 



T 



<s„.i#"-'- '^''T""' i. ('« 



where the last inequality holds because of Lemma 6 and the fact that z^'''^ ^' is a supersolution (Lemma 12). 

;n using (24) and (12) w 

OG[V/(z('='-'"))],+Asign(zf^^')) 



(k i) Ik 7 — 1) 

If the update is trivial, i.e. zj ' = zj ^ then using (24) and (12) we have 



which coupled with (13) gives 

Zj - -^A/L yZj J^ j < bx/L yz^ 

where the last inequality is obtained by applying the isotone operator S^/l o (I — Vf/L) to the inequality 
zC'-'O) < z(k.j-i) which holds by lemma 12. Thus (35) holds irrespective of the triviaUty of the update. 

Now applying the same isotone operator to the inequality z^'''^^^> < y'-^^^^^'i and taking the jth coordi- 
nate gives, 

i=X/L \^Z^ j < bx/L [Vj J^ 

(k j) 

The right hand side above is, by definition, y- . So, combining the above with (35) gives (34) and proves 
our inductive claim. ■ 

5 Convergence Rates 

Our results so far have given inequalities comparing the iterates generated by the three algorithms. We finally 
want to compare the function values obtained by these iterates. For doing that, the next lemma is useful. 

Lemma 14 Ify is a supersolution andy < x then F{y) < F(x). 

Proof: Since F is convex, we have 

F{y) - F(x) < {Vf{y) + Xp,y - x) (36) 

for any p E d\\y\\i. We have assumed that y < x. Thus in order to prove F{y) — F{x) < 0, it suffices to 
show that 

Vi e [d], 3p, G sign(2/i) s.t. 7^ + Ap^ > (37) 

where, for convenience, we denote the gradient V f{y) by 7. Since j/ is a supersolution. Lemma 5 gives, 

V*e[d], y^>Sx,L(y^~Y) ^^^^ 

For any i e [d], there are three mutually exclusive and exhaustive cases. 



Case (1) : yi > ''''^ Plugging this value in (38) and using the definition of scalar shrinkage (9), we get 

yt>v^ — 

which gives 7i + A > and hence yi > 0. Thus, we can choose pi = 1 G sign(?;j) and we indeed have 

7i + Xpi > 0. 

Case (2) : y^ e [^, ^] In this case, we have y, > Sx/Liy't^ " 1) = 0. Thus, 

li + X . . ^ 

-^>y,>o. 

Thus we can choose p^ = 1 e sign(yi) and we have 7, + Xpi > 0. 
Case (3) : yt < ''^^ Plugging this value in (38) and using the definition of scalar shrinkage (9), we get 

yt>Vt J— 

which gives 7i — A > 0. Now if yi < 0, we can set p = — 1 G sign(yi) and will have 7^ + Xpi > 0. On 
the other hand, if yi > 0, we need to choose pi = 1 and thus 7^ + A > should hold if (37) is to be true. 
However, we know 7; — A > 0, and A > so 7,^ + A > is also true. 

Thus in all three cases we have that there is a pj G sign{yi) such that (37) is true. ■ 

There is a similar lemma for subsolutions whose proof, being similar to the proof above, is skipped. 

Lemma 15 Ify is a subsolution and y > x then F(y) < F{x). 

If we start from a supersolution, the iterates for CCD and CCM always maintain the supersolution prop- 
erty. Thus Lemma 14 ensures that starting from the same initial iterate, the function values of the CCD and 
CCM iterates always remain less than the corresponding GD iterates. Since the GD algorithm has 0{l/k) 
accuracy guarantees according to Theorem 2, the same rates must hold true for CCD and CCM. This is 
formalized in the following theorem. 

Theorem 16 Starting from the same super- or subsolution x'^^^ — y'"^ = z^^^, /ef ja;'*^^}, {y'*^-'} and {^z'^^^^ 
denote the GD, CCD and CCM iterates respectively. Then for any minimizer x* of (2), and Vfc > 1, 

T IIt* _ T-(0)||2 

i^(z('=)) < F(2/('=)) < i^(x('=)) < F{x*) + ^^ ^—^ 

2 k 

6 Conclusion 

Coordinate descent based methods have seen a resurgence of popularity in recent times in both the machine 
learning and the statistics community, due to the simplicity of the updates and implementation of the overall 
algorithms. Absence of finite time convergence rates is thus one of the most important theoretical issues to 
address. 

In this paper, we provided a comparative analysis of GD, CCD and CCM algorithms to give the first 
known finite time guarantees on the convergence rates of cyclic coordinate descent methods. However, there 
still are a significant number of unresolved questions. Our comparative results require that the algorithms 
start from a supersolution so that the property is maintained for all the subsequent iterates. We also require an 
isotonicity assumption on the I — V//L operator. Although this is a fairly common assumption in numerical 
optimization [2], it is desirable to have a more generalized analysis without any restrictions. Since stochastic 
coordinate descent [10] converges at the same 0{l/k) rate as GD without additional assumptions, intuition 
suggests that same should be true for CCD and CCM. A theoretical proof of the same remains an open 
question. 

Some greedy versions of the coordinate descent algorithm (e.g., [15]) still lack a theoretical analysis of 
their finite time convergence guarantees. Although [3] has a 0{l/k) rates for a greedy version, the analysis 
is restricted to a simplex domain and does not generalize to arbitrary domains. The phenomenal performance 
of greedy coordinate descent algorithms on real life datasets makes it all the more essential to validate these 
experimental results theoretically. 
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Appendix 

A Proof of Lemma 11 

Since g{a) = f\i{a; zC^J^i)) we have 



ffcj-i) (fej-i) 



g'{a)= v/(zi-^-^^z^^-^^...zpr^a,z}-^^^■••^r"^o 



,{k,]~l) ^_ Jk.j-1) 



Therefore, 



^IfJkj-l) 



g'{zP'-'>) = [Vf{z^''^-'% 



(39) 



Jk,j 



Since, by definition, z- is the minimizerof 5(a) + A|a|, we have 



,(fcj) 



(fej)^ 



0e5'(4 ''O + Asignlzf-^O 

For notational convenience we denote z- ' as a*, since it is the minimizerof 5(a) + A|a|. With this notation 
we have. 



^.* Jk.j-i) 



(40) 



Note that r is well defined since the denominator is non-zero by our assumption of a non-trivial update. 
Further, r > by Assumption 10 and t < L since V/ (and hence g'{a)) is L-Lipschitz continuous. 
Depending on the sign of a*, there are three possible cases: 



Case (1): a* > 0: This impHes that 
By (40), 



g'(a*) + A = G 



(41) 



Plugging this in (41), we get 



J/Jk,j-1) 



(z)^'^-'>)+r{a* -zf ''-'>) + X = 



(fcj-l)N 



Using the definition of shrinkage operator (9) combined with the fact that a* > 0, we have 

-a (^i ) - - 



a — z 



= S\/t 2 



(kj-i) 9'{Zj ) 



Case (2): a* = 0: The corresponding condition is 

Oe[g'{a*)-X,g'{a*) + X] 
Again using (40), we have 

g'ia*) = g'iz^f-'-'^) + ria* - zf'^-"^) - g'(zf '^"'^) - T{zf''-^'^) [since a* = 0] 



a* =oe 



{z. } ikj-i) A g (,z, j 



JfeJ-1) 



JfeJ-1) 



+ - 

r 



(fcj-i) 9 (Zj 






=^ a* = = Sx/t- z 

where the last step follows from the definition of the shrinkage operator (9). 
Case (3): a* < 0: This implies that 

g'{a*) - A = 




y^xj 



y = S, (x, - [Vf{x)]j) 



[Vf{x)], [Vf{x)], + A 



Figure 1 : Interval to right of zero 



[V/(x)], - A 




y ^Xj 



y = Sa {x, ~ [V/(x)],) 



[Vfix)], [V/(x)],+A 



Figure 2: Interval crossing zero 



Using (40) to substitute for g'{a*) as in the previous cases, we have. 



which yields 



ff'(zf '■'-' V T(a* - 2f ''"' V A = 



* (fe,j-i) 1 u (fej'-i)\ , ^ 
y = Zj-' ' g (z'j ) + - 



where the last inequality follows because a* < 0. 
Combining these three cases and using (39) we get 



Jk.j) _ ^ ( (kj~l) [W(^ '^ )], 

Zj - »\/t Zj - 



B Proof of lemma 5 

We prove the supersolution case only as the subsolution case is analogous. Let for a particular r > 0, 

X > S\/t ( X ^T^ ) ■ We prove the inequality for the scalar S operator on an arbitrary coordinate j. The 

subsequent proofs are divided into three disjoint cases related to the values taken by the shrinkage operator. 

Case 1 [V/(a;)]j - A > 0: 

This is illustrated in figure 1 . Depending on whether r > 1 or not, the graph of the shrinkage operator 
shifts left or right, but clearly division by r does not change the sign of the shrinkage operator value 



y^Sx (xj - [Vi{x)]A 



[Vf{x)], - A [V/(x)], 




y^xj 



Figure 3: Interval to left of zero 



at any point. As is evident from figure 1 , the graph of y = Xj always lies above that of the shrinkage 
operator. Thus 



Xj > Sx/t I x^ 



[V/(a;)] 



(42) 



for all values of t and in particular for r = 1. Thus a: is a supersolution. 

Case 2 e [[V/(x)], - A, [V/(x)], + A]: 

The corresponding case is illustrated in figure 2. It is clear from the figure that Xj > S\/r ( Xj — ^'^ ^ 

for positive t, only when Xj > 0. Just as in the previous case, changing the value of t shifts the graph 
by appropriate scale without changing its sign. Thus (42) holds for Xj > irrespective of the value of 
T. In particular, it should hold for t — I which proves that a; is a supersolution. 

Case 3 [V/(a;)]j + A < 0: 

As illustrated in figure 3, in this case the graph of the shrinkage operator will always lie below the value 
of Xj. Thus (42) will not be satisfied for any value of r which makes the case vacuous. 

To prove the converse direction, we look at the same three exclusively disjoint cases for an arbitrary 
coordinate j. 

Case 1 [V/(a;)]j - A > 0: 

As seen from figure 1, a; is always a supersolution since [Vf{x)]j + A > [\7f{x)]j — A > and the 
graph of the shrinkage operator uniformly stays below the value of Xj . Since the sign of the shrinkage 
operator value does not change due to division by r > 0, (42) holds for arbitrary positive r. 

Case 2 e [[Vf{x)]j - A, [Vf{x)], + A]: 

If X is a supersolution, it means that the value attained by the shrinkage operator lies below the value of 
Xj , which is true when Xj > (Figure 2). In this subset of the domain, division by t maintains the sign 
of the shrinkage value and thus (42) holds. 

Case3[V/(a;)]j+A<0: 

In this case the graph of the shrinkage operator always lies above the value of Xj . Thus x can never be 
a supersolution if this condition holds true. 



C Proof of lemma 6 

Let 



Hr) 



'A/t 



[V/(x)], 



We again look at the three disjoint cases for arbitrary ti,T2 G (0, cxo) with ti > T2 and show that /i(ti) > 

h{T2). 



y = Xj 




y^S,/^,{x,-'^I^y.y^S,/^,{. 



[V/(^)], 




Figure 4: Interval to right of zero 



y = Xj 







[V/(^)]j 



Figure 5: Interval crossing zero 



Case 1 [Vf{x)]j - A > 0: 

Since both the hinge points in the graph will be positive (figure 4 ), we have 



[V/WL-A ^ [Vf{x)],-X 



^jj(j i^ jy-^)\3^^ < Lvjv-^;j3-t-A ^ Thus it is trivial to see that the graph of h{Ti) is always greater than 



[V/(^)], + A ^ [V/(:^)],+A 
h{T2) 

Case 2 e [[V/(x)], - A, [V/(x)], + A]: 

Since a; needs to be a supersolution, we only need to consider the subset of the domain when Xj > 0. 
We still have i^f(^Jh+^ < 1^/(^)1^-+^ and it is obvious from figure 5, that /i(ri) > /i(t2). 

Case3 [V/(x)]j + A < : 

Since x can never be a supersolution in this case as shown in the proof of lemma 5, this case is vacuous. 



